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Abstract 



In this paper we present an extension of population-based Markov chain Monte Carlo (MCMC) to 

the trans-dimensional case. One of the main challenges in MCMC-based inference is that of simulating 

from high and trans-dimensional target measures. In such cases, MCMC methods may not adequately 

' traverse the support of the target; the simulation results will be unreliable. We develop population 

^ , methods to deal with such problems, and give a result proving the uniform ergodicity of these population 

4— > . 

C/2 . algorithms, under mild assumptions. This result is used to demonstrate the superiority, in terms of 

convergence rate, of a population transition kernel over a reversible jump sampler for a Bayesian variable 
^ ' selection problem. We also give an example of a population algorithm for a Bayesian multivariate 

^ : 

QQ , mixture model with an unknown number of components. This is applied to gene expression data 

of 1000 data points in six dimensions and it is demonstrated that our algorithm out performs some 
competing Markov chain samplers. 
' Some key words: Population Monte Carlo, Uniform ergodicity, Bayesian variable selection. Mixture 

o ■ 



models, Gene expression data 



1 Introduction 

The Metropolis-Hastings algorithm (Metropolis et al. 1953; Hastings, 1970) and its adaptation to the trans- 
dimensional case (Green, 1995) has provided a method to simulate from complex probability measures in 
high dimensions. This has facilitated the application of (particularly Bayesian) complicated statistical 
models which could not otherwise be fitted. 

We consider the problem of simulating from a probability measure T:{x)X{dx) defined on measurable 
space {E,£) (where A is a cr— finite measure on £), with vr known pointwise, at least up to a normalizing 
constant. This is achieved (in the context of MCMC) by simulating an ergodic Markov chain {A'„}„>o, 
with kernel K : E x £ ^ [0, 1], of stationary distribution vr. The Markov chain can be used, for example, 
to estimate expectations of tt— integrable functions. In this article we focus on reversible jump Markov 
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chain Monte Carlo (RJMCMC), where the state space is a union of subspaces of differing dimension, that 
is, where E = \J^^^ {{k} x X; C N, C M'^. 

In statistical terms, tt will often be a Bayesian posterior distribution which is normally known pointwise, 
up to a normalizing constant. For example, applications of RJMCMC include classification and regression 
(Denison et al., 2002) and mixture modelling (Richardson and Green, 1997), with particular emphasis on 
model determination. However, in many examples, the naive (vanilla) RJMCMC sampler can fail to move 
around the support of the target in a feasible computation time; sec Brooks ct al. (2003) for examples. 

To deal with these problems, several MCMC approaches have been suggested, including auxiliary vari- 
able methods (Brooks et al., 2003) and tempered transitions (Jennison ct al., 2003): Green (2003b) provides 
a recent review. One approach used for difficult sampling problems in fixed dimensional spaces, that have 
not been widely used in the variable dimension case, relies on population-based MCMC methods (Liang 
& Wong, 2001; Liu, 2001). It is straightforward, conceptually, to extend this approach to the variable di- 
mension case. Such an extension is the focus of this paper; we study the potential theoretical and practical 
advantages of population approaches over standard MCMC methods. We remark that methods other than 
MCMC may be used for difficult simulation problems, such as sequential Monte Carlo (e.g. Del Moral et 
al., (2006)), but such methods are not the focus of this paper. 

1.1 Population-based Markov chain Monte Carlo 

Population-based MCMC operates by embedding the target into a sequence of related probability measures 
{"■jjieTiv I Tf-ZV := {1, • • • , and simulating the N parallel chains (the population), as in parallel tempering 
(Geyer, 1991; Hukushima & Nemoto, 1996). In addition, the chains are allowed to interact via various 
crossover moves; a summary is given in Section 3 - see Liu (2001) for an extensive review. 

The main advantage of population-based simulation over other methods is the fact that the population 
simultaneously represents many properties of the target distribution. This is particularly useful in trans- 
dimensional simulation, where it can be difficult to construct efficient dimension-changing proposals. Green 
(2003b) notes that some MCMC methods retain information about which states have been visited, for 
example, the product space approach (Carlin & Chib, 1995; Godsill, 2001), whilst standard reversible 
jump MCMC does not retain this information. There are advantages to both approaches. The first 
approach can provide an improved mode-jumping property; that is, the ability to jump a large number 
of dimensions that would take a substantial time under the standard approach. The second approach has 
greater capacity to discover new states that are consistent with the target. It is clear that an algorithm 
which can combine these properties is likely to provide an improvement over both methods; the objective 
of this paper is to construct such an algorithm, and to investigate its theoretical properties. 
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1.2 Contribution and Structure of Paper 

Theoretical aspects of population-based MCMC have been rarely considered (see, however, Madras and 
Zheng (2003)), and therefore the improvements, if any, offered by population methods over standard 
MCMC have not been fully established. In this article we present a result which ensures (under fairly 
mild conditions, including that the density tt is upper bounded) the uniform ergodicity of a population 
transition kernel, and allows the construction of population algorithms which are preferable, in theory, to 
their single chain counterparts. The result can help to illuminate, for small N, why population algorithms 
can work well in practice. 

We also demonstrate that a particular population-based kernel - the exchange kernel (see Sections 13.41 
and lS.ip which is fundamental to swapping information in population MCMC - can improve convergence 
properties (over parallel MCMC, in which independent identical Markov chains are run) under strong 
assumptions; this has not, to our knowledge, been established previously. Our results are substantiated 
with an example in Bayesian variable selection, and illustrate the use of population MCMC methods in a 
mixture problem. 

Population methods naturally accommodate multiple MCMC strategies which improve the ability of 
the sampler to mix across the state space. In our main example we show how to combine the methods of 
parallel chains (Geyer, 1991), tempering (e.g. Geyer & Thompson, 1995), snooker algorithms (Gilks, et. al., 

1994) , constrained sampling (e.g. the unpublished technical report of Atachde & Liu (2004)) and delayed 
rejection (Green & Mira, 2001). We believe that although such methods may not perform adequately 
(individually) together they can provide a superior MCMC sampler. 

This paper is organised as follows. In Section 2 we provide an illustrative example related to the 
clustering of gene expression data via Bayesian multivariate mixture models. In Section 3 we introduce 
population-based reversible jump. In Section 4 we give some theoretical results that can indicate why 
population methods can perform well. In Section 5 we present population MCMC moves for the mixture 
model example. In Section 6 we provide a comparison of vanilla, simulated tempering (Geyer & Thompson, 

1995) and population samplers for the mixture model example. In Section 7 we conclude with a discussion, 
detailing extensions to our approach. 

1.3 Notation 

The notation and mathematical objects that are adopted in the paper are summarized here. 

A measurable space is denoted {E,£): throughout this article f is a countably generated cr— algebra. 
The product cr— algebra is written £^ :— £ (S> ■ ■ ■ <E) £ (product A^— times). We use Sx{dx') to represent 
Dirac measure 
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For two probability measures, Ai and A2, on £, the total variation distance is written ||Ai — A2||Ty := 
sup^gg 1-^1 (^) ~ -^2(^)1- The Dobrushin coefficient (Dobrushin, 1956) of a Markov kernel K is denoted 
P{K) := sup(3. j^)£^2 ||^(a;, •) — K{y-)\\Tv- The composition of two Markov kernels, K and P is written 
K o P[x,dy) := K{x,du)P{u,dy), except ii K = P where K"^ is used. Particular Markov kernels of 
interest are the product and mixture kernels that combine component kernels Ki in a multiplicative and 
additive fashion, which will be denoted 

i i 

for mixture weights t^. Given a probability measure A and Markov kernel K, the standard notation 
XK{dy) :— \{dx)K{x, dy) is adopted. For a bounded measurable function /, the oscillations are written 
osc(/) = SMV{x,v)eE^ \f{x) - f{y)\. 

Finally, we use the vector notation Xi:n ■= {xi, . . . , Xn) and denote a vector with its i*^ element missing 
as X-i and with only its i*^ and elements as ajjj. Also, Ti-p := {I,. . .,p\ for any I < p, {l,p) e Z^. 

2 An Illustrative Example: Finite Mixture Modelling 

Mixture models are typically used to model heterogeneous data, or as a simple means of density estimation; 
see McLachlan & Peel (2000) for an overview. Bayesian analysis using mixtures with an unknown number 
of components has only fairly recently been implemented (see Richardson & Green (1997) and in the 
multivariate context Stephens (2000) and Dellaportas & Papageorgiou (2006)). 

In this section, we describe the finite mixture model adopted and a motivating example, in which 
standard MCMC methods do not perform adequately. 

2.1 Model 

Let yi;n denote observed data that lie on support yi € Y C W, i = 1, . . . ,n. We assume that the yi are 
i.i.d with density: 

k 

p{yi\Vi:k,w-i:k) = Ywjf{yi;r]j) 

where r/uk are component specific parameters, the weights w^k are such that Y^'j^i wj = 1, Wj > \/ j, 
p denotes an arbitrary probability density function and / is the component density. For our model, we 
restrict ourselves to the case of multivariate t, 7^(/i, A, s), where (/x. A) are the location and covariance 
parameters and s is the degrees of freedom. 

In specifying the prior distributions, we follow Stephens (2000); the component mean vectors are taken to 
be independent J\fr{£,, k^^) (multivariate normal distribution), and the A^ are independently I>Vr(2a, 2\1/), 
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where TWr(-, ■) is the inverse Wishart distribution. The following hierarchical structure is adopted: 

* ~ Wr{2g, {2h)~^) where Wr{-, •) is the Wishart distribution 
wi:k-i\k ~ 'D{5) where !?(•) is the symmetric Dirichlet distribution 

k ^ ^{i,. -,femax} where Us is the discrete uniform distribution on countable set S. 

When using the multivariate i-distribution, the degrees of freedom are assumed known. Thus, our prior is: 



P{wl:k-l,IJ'l:k,^l:k,'^,k) = 



k 



np(M,)p(A,i*) 



p{^)p{wi:k-i\k)p{k). 



2.2 Data Processing and Prior Distributions 



For our example we consider the problem of clustering gene expression data ( e.g. Heard ct al. (2006)). 
The data consist of the relative level of gene expression - a measure of genetic activity - for n = 4221 genes 
of the parasite Plasmodium measured at r = 46 time points across a 48 hour portion of the parasite life 
cycle. The data are discussed in detail in Bozdech et al. (2003). Finding meaningful subgroupings of the 
data is an important task for biological investigators. 

Even with modern computing power, applying a fully Bayesian analysis to such data is not practical. 
Therefore, the data is preprocessed the to reduce the n x r-dimensional data to I x q dimensions. We 
achieve this by adopting a K-means partitioning approach to reduce n to Z, and then principal components 
to reduce r to q. We selected I = 1000 and q = &. 

Prior hyperparameters are set in a similar way to Stephens (2000); we set ^ to be the midpoint of the 
observed data in its corresponding dimension, and k is taken to be diag(l/iii, . . . , 1/ii^) where Rd is the 
range of the data in dimension d G Tg. Additionally g = q/2, 6 = 1 and a = a' + {q + l)/2, where a' = 3. 
Finally h is diag(100(7/(2ai?^), . . . , 100q/{2aR'^)). For illustration a i-distribution with four degrees of 
freedom is used as the component density in our mixture model and set fcmax = 20. 

2.3 Performance of Vanilla Sampler 

A vanilla reversible jump sampler (outlined in Appendix 1) was implemented for the data above. We ran 
the reversible jump algorithm for 250000 sweeps from two different starting points, which had different 
initial k. The C program was run on a Pentium 4, 3 Ghz machine and took approximately three hours. 
We observed extremely poor mixing, with all variable dimension acceptance rates below 1%. It can be 
seen that there appears to be support for k G ¥3:5 components, but also for k G Tg^i. The main problem 
is that the vanilla sampler cannot jump between these two modes. 

The poor performance of the vanilla algorithm can be partly attributed to the difference in dimen- 
sionality between different mixture models. To jump from a fc to a A; + 1 component mixture model, we 
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need to draw 1 + q + q{q + l)/2 = 28 random variables, so we will need proposals/jump functions that 
are more tailored than those used in the vanilla sampler. In addition, if between-model jumps are made 
infrequently, a within-model chain spends a long time in a high density region of the state space specific to 
that model that is consistent with the data. This within-model adaptation renders a jump between models 
even less likely, as high density regions in different models do not necessarily correspond to each other in 
a straightforward fashion. This heuristic argument applies to local moves, and whilst more global moves 
might be constructed (as noted by Green (2003a), they are more likely to produce better mixing than the 
local moves attempted here), dimension matching dictates that this will be difhcult to achieve efficiently. 

2.4 Alternative Algorithms 

Possible MCMC methods that might be used to deal with the problems encountered here may be the 
auxiliary variable method of Brooks et al. (2003), but this will require a reasonable movement of the chain 
in the first place. Constructing proposal distributions by creating an approximation of the target in each 
dimension using fixed dimensional MCMC (Hastie, 2005 (University of Bristol PhD thesis)) is complicated 
by the label-switching problem (see Jasra et al. (2005) for a discussion). Delayed rejection (Green & 
Mira, 2001) and tempered transitions (Jennison et al., 2003) often do not provide a general solution to the 
problems highlighted by this example. For the former method, insufficient information is learnt at the first 
stage rejection to provide a significantly improved second stage proposal, whereas for the latter, it often 
takes a large number of intermediate simulations (e.g. 100) to provide a reasonable proposal, but even so, 
the performance gain is not always substantial. 



3 Population-Based Reversible Jump 

We now consider population-based reversible jump algorithms. First we give details of the population 
MCMC method, and then study the theoretical properties of the algorithm, in particular its uniform 
ergodicity. 

3.1 The Population MCMC Method 

Consider a sequence of probability measures {T^i}ieTNJ ^^ch assumed to admit a density (also denoted tTj) 
with respect to A on {E,£). Denote the support of the i*^ density Ei = {{xk,k) G E : TTi{xk,k) > 0}, 
i G Tjv. Define probability measure Tr*{d{xi;ki,ki, . . . ,xi:k,^,kN)) on {E^,£^) as: 

N 



TT*{d{xi;ki,ki, . . . ,Xi;kN,kN)) 



Yl^ T^tixi-.ki , h)X{d{xi.,ki , h)) 
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Our objective is to now construct an ergodic Markov kernel K : x £^ [0, 1] with stationary 
distribution tt*. 

3.2 The Structure of the Population 

The population approach proceeds as follows; we generate N parallel (variable dimension) chains in order 
to explore the target correctly. For the remainder of the paper tti = tt and the sequence of densities are 
taken tt^ cx tt'-*, 1 = Ci > ■ ■ ■ > Cn > 0, where {CijieTjv &re inverse temperature parameters; the collection 
{Ci} is referred to as the temperature ladder. This is the approach in Liang & Wong (2001), but other 
settings include taking Q — 1 for each i; see Del Moral et al. (2006) for further discussion. We seek to use 
the extra information contained in N chains at different temperatures to allow large moves in dimension 
of the chain of interest as well as allowing improved performance in more local moves (within and between 
dimensions). 

One of the main problems of parallel tempering (Geyer, 1991; Hukushima & Nemoto, 1996) is that only 
minimal interactions between the chains are allowed. Our approach differs as we will allow K to include 
moves which use the entire population, other than merely the exchange move (see Section l3.4p . Thus we 
seek a more population-based approach to justify the increased cost in computation. 

We now investigate some theoretical aspects of population algorithms. As our results are not confined 
to the trans-dimensional case, we drop the k from the notation. That is, is a general state space with 
associated countably generated cr— algebra £. 

3.3 Some Theory for Markov chains 

We will concentrate on the concept of uniform ergodicity (see Roberts & Rosenthal (2004) and the references 
therein). Our objective is to utilize the following property of uniformly ergodic Markov kernels K; the 
total variation distance between the n— step kernel and its stationary measure tt, (||iir"(a;, •) — 7r(-)||Ty), is 
bounded above by: 

R = (l-e)^^J 

where uq and e are parameters in the following minorization condition (for C ~ E): 3e > such that for 
V some non-trivial probability measure and integer no > and Vx G C, A G f : 

^•""(2;,^) > €v{A). (3.1) 

If C G £ is such that (|3.ip is satisfied, then we term it (no, e, j/)— small. 

It follows that, to compare convergence speed (note that spectral gap techniques - see, for example, 
Diaconis & Saloff Coste (1993) - can also be considered; see Section [175)) of two uniformly ergodic Markov 
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kernels, we can compute the pair (no, e); if one of the kernels has a substantially larger e (and smaller no), 
then we might expect it to converge more quickly to the target of interest. 

3.4 Main Result 

We now demonstrate that population MCMC approaches can be defined, using specially constructed com- 
ponent kernels, so that they are uniformly ergodic. For the following Theorem we denote a mutation 
(Markov transition) kernel as Km and an exchange kernel Ke- A mutation move is a Metropolis-Hastings 
(MH) kernel which attempts to change a single member of the population, dependent only on the current 
state of the chain for that member. An exchange move is a Metropolis-Hastings kernel which proposes 
to swap the current states of two different members of the population, mathematically it is defined as, 
i^rf^E^x £2 ^[0,1]: 

K^i\x,,M,i) = min{l,^^4:^^|<5.,(dx:)4.(d^;) + 
1 -min<^ 1 ^ ^ ^ ' 



. X . w 5^Mx'i)^xi{dx'i). 

■Ki{Xi)'Kl{xl) J J 

For simplicity, we assume that Ei = E Vi gT^. We now give our main theoretical result; 

Theorem 1. Let K be a Markov kernel defined on a measurable space {E^ ,£^) such that: 
K{xi;N, dx'i,j^) = (^tKm + (1 - t)Ke^ {xi:N, dx[, 



N 



KM{xl:N,dx[,j^) = Yl^ii^i^dx'i) 



1=1 

AT-l AT 



KE{xi:N,dx[,N) = ^ ^il^E\^i,l'dXij)S^_^,^^{dx_^i^i^) 
i=l l=i+l 



with 



N-l N 

T,eu G (0,1) = 1 

i=l l=i+l 

and for each i G Tn Ki is an aperiodic, X-irreducible Markov kernel with invariant measure Tri{x)X{dx). 
Suppose that Kj* is uniformly ergodic for one j* e Tjv and for each i ^ j* 3gi e (0, oo) such that 
Tri{x) < QiTTj* {x) V x G E. Then K is uniformly ergodic. 

Proof. See Appendix 2 for the proof. □ 

Remark 1. The assumption of uniform ergodicity for at least one Kj* is not overly restrictive. In many 
applications, for example, Bayesian analyses with a proper prior and a bounded likelihood, a proposal 
under an independence kernel with proposal density where TT{x)/q{x) < g V x € E (see Tierney (1994), 
Mengersen & Tweedie (1996)), can be found - for example, where q is the prior density - which ensures 
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uniform ergodicity at the cost of being a poor proposal for tt. However, if applied to a related distribution 
in the population, this proposal may perform quite well. The assumption TTi{x) < ftTr^. (x) \f x £ E is quite 
reasonable and would apply in the framework described here, when tt is bounded, for the case j* = N. 
Crucially, for small N, we can establish that the fast rate of convergence for one of the kernels is propagated 
through the population (see Section [321) ■ We note that the result holds for the case where Km is a mixture 
of kernels, but have omitted the details for brevity. 

Remark 2. The Theorem shows, in simple cases, where we can design uniformly ergodic chains for the 
target tt (which are likely to perform well in practice), how to compare population and single chain ap- 
proaches. In other words we can construct a population sampler which has a faster rate of convergence 
to TT* (and hence tt) which justifies the increased cost in computation. Additionally, we can investigate 
the population kernels, such as K, which are likely to provide good mixing for more complex examples. It 
should be noted that, due to the limitations of investigating the minorization conditions, it is difficult to 
use this result when N is large; we discuss this further in Section 11751 



3.5 Impact of the Exchange kernel 

In the following result we show that the exchange kernel, which is reducible, can indeed improve the rate 
of convergence relative to a parallel MCMC algorithm (no interaction between the chains). To study this 
phenomenon, we introduce the following mixing condition (M) (e.g. Del Moral (2004)) for Markov kernel 
K : X £^ ^ [0,1]: 3 e > such that V {x, y) e E^'^ 

K{x,-)>eK{y,-). (M) 

Also, introduce the set = {(i, I) : i,l £ Tjv, i ^ l\. 

Proposition 2. Assume Km satisfies [M\) . Then for any initial distribution rj and n > 1; 

\\r]{KMoKEr -^*\\tv < [2(l-a)(l-e)]"||?7-^*||Ty 
where a = Y.{i,i)eT'j^ - inf(a;i,a;,)e-E2 Pi,i{xi,xi)] and pi^i{xi,xi) = min |l, 
Proof. See Appendix 2 for the proof. □ 

Remark 1. The result provides a sufficient condition to improve upon parallel MCMC, in that it gives a 
tighter bound on the total variation (which is, under (M), (1 — e)"^]]?? — 7r*||Tv - the factor 3/2 is used 
to make a fair comparison, in terms of CPU time - for Markov kernel Km)'- the exchange probabilities 
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are lower-bounded by 1 — ^ ^^'^ . Essentially, it is crucial to select a good sequence of densities and use 
population moves to improve convergence. As noted by a referee, this indicates that if e is small, that is K]\{ 
is poorly mixing, the exchange step can help to improve the convergence rate. In addition, if V(a;, y) G E'^, 
e we have < < gu then the lower bound may be achieved if > 1 ~ which 

suggests that we would require the densities to be similar, to improve convergence rates. 

Remark 2. The result is only of use on compact spaces (such as for the finite state space variable selection 
example of section |4] below) , where we are able to lower bound acceptance probabilities and apply the 
mixing condition (M). However, even on compact spaces, (M) may be difficult to verify for MH kernels due 
to the rejection probability. We note that this can be circumvented by iterating the kernel. 

4 Example: Bayesian Variable Selection 

We have established the potential theoretical benefits that population MCMC methods offer. A specific 
example is now studied. 

4.1 Model and Data 

Consider the statistical model: 

fcmax 

Ui = 70 + ^ "^jlj^ij + 
J = l 

with -117,; i.i.d A/'(0, cr^), "dj G {0,1}, and 7j G M. If we consider the conjugate prior specification: 
p(7o:fc|f, fc) — A4+i('7i, (j'^V), p{<y'^) — 1Q{a, b) (where TQ{-, ■) is the inverse Gamma distribution) and 

(where Sk — {"^i-.k^^^ £ {0, 1]'^'^^^ : X^j^o" ^ ^}) then we can integrate out the parameters (cr, 7o:fc) and 
sample from a distribution on a finite state space. 

We generated 100 data points from a linear model, with fc^ax — 8 (i.e. 256 states). The posterior 
probability of a null model was 0.55 and 0.33 for the saturated model. This is a typical (but simplified) 
situation for which a standard MCMC sampler would fail to move around the state space easily. 

4.2 Comparison of Convergence Rates 

To sample from the posterior distribution we use an MCMC algorithm detailed in Denison ct al. (2002) 
page 53 (with modification to the variable selection case). 
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Since the state space is finite, it is clear that an ergodic Markov chain with appropriate stationary 
distribution is miiformly ergodic. To construct an appropriate v in (|3.ip (i.e. that leads to a large e) we 
used the approach discussed in Chapter 6 of Robert & Casella (2004) (for example). Let Kf^ G E) 
denote the n step transition probability and suppose that infjif"} > for some j, then \f j 



> ini{Kr,} = ev, 

say, where 

V, = and e = V inf {i^'}}. 

leE ' 

For the algorithm discussed above, we found that the bound on the rate of convergence was reasonably 
similar for uq = 1000 to uq = 5000; we focus upon the pair (1000,3.63 x 10^'^). To make the analysis 
computationally comparable to the population algorithm described below, we let 50 applications of this 
kernel be equivalent to a single step (i.e. this new kernel has (20, 3.63 x lO^'^) as the (no, e) pair). 
For a population sampler, suppose we take a single auxiliary distribution: 



7r2 



with C = 0.01 and L the likelihood function. The choice of 0.01 is used for illustration, and is adopted 
to demonstrate the impact of the usage of a related, but easier to sample, distribution in the population. 
We concentrate upon a kernel which updates both chains via the MCMC algorithm mentioned above for 
10 sweeps, followed by an exchange, then another 10 sweeps (which corresponds to the kernel we sample 
from, that is, it is a single time step). 

It can be shown that the {uq, e) pair for the population sampler is (1, 6.01 x 10^^). This was computed 
by finding the e and v in the minorization condition for 7:2 (as above) and then: 



V '^i^i-.k^.^, k) mm <^ 1, — — — ^ 



7ri(i9i:fc,„,,,A:) 

ei = sup — — p- 

the constant in the minorization condition is then e^cj). 

The bound on total variation distance, R, suggests a much faster rate of convergence for the population 
algorithm: Mq.oi = 25326 (the number of iterations to achieve a bound on the total variation distance 
less than 0.01) for the vanilla algorithm and Afo.oi = 7660 for the population algorithm, that is, it is 
significantly faster. 

This example demonstrates that a simple extension of the original algorithm to include an auxiliary 
distribution that provides a good proposal (for the original target) allows efficient movement around the 
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state space. That is, it reiterates the point in remark 1 to Theorem [TJ that the fast convergence rate of 
one of the chains is propagated through the system. 

4.3 Summary 

A point of interest, raised by a referee, is that we found as we increased N we found it difficult to find 
a Markov kernel that could improve upon rate of convergence to stationarity when N = 2. This can be 
explained as follows; suppose, for illustration, we run N parallel chains with the same (marginal) invariant 
measure TT{x)X{dx) with kernel Km ° Ke o Km in the proof of Theorem[T] Assume that all the mutation 
kernels, except one, are strongly aperiodic [hq — 1) and mix almost perfectly (that is, are uniformly ergodic 
with e w 1). Then we can easily establish that the constant in the minorization condition is ^ as 
N oo, so that R Ki 1. This establishes two points: 

• Firstly, that continually extending the state-space will inevitably lead to slower convergence of the 
Markov chain, unless very efficient population moves may be constructed. That is, we should not 
naively extend the state-space and expect our Markov chain to converge more quickly; unlike con- 
vergence of particle algorithms (Del Moral, 2004), there is not necessarily an improved convergence 
property as cxd. This is an open problem for future research. 

• Secondly, that in cases where R is not informative on the rate of convergence, it may be more beneficial 
to investigate other properties of the Markov kernel, such as the spectral gap. Such an analysis is 
likely to be far more involved than discussed here; see Madras & Zheng (2003) for example. 

5 Population Moves for the Mixture Example 

Now that we have established, for difficult problems, that population methods can lead to faster conver- 
gence, we discuss how to implement population moves for our mixture example fSection l2.3p . Our notation 
is such that di — {rj^ , , '^'^ , ki) , i G Tat and [rf ,w^) refers to all of the component specific parameters and 
weights for chain i. In our tempering approach (i.e. tt'-*), we will temper the likelihood terms only, rather 
than the full posterior, to avoid any integrability problems. We now proceed to combine several MCMC 
methods to improve the mixing ability of the chain. 

5.1 Exchange Moves 

An exchange move is used to swap information between two different parallel tempered chains. Our 
strategy is as follows: at iteration t we select two adjacent chains (in terms of the temperature parameter) 
uniformly at random and propose to swap their values. In order to achieve a reasonable interaction between 
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the chains, the temperature ladder is set so that this move is accepted about half of the time (Liu, 2001); 
see Section [5] for further discussion. 

One way to improve this move is to use the delayed rejection method, as suggested by Green & Mira 
(2001). At iteration t, we select any two chains ii and i2 to swap, accepting or rejecting with the usual 
Hastings ratio, that is, with probability 



where the labelling of the chains is with respect to the current state of the chain and d[.j^ denotes the 
new configuration of chains. If this is rejected, we select two adjacent chains 13 and ^4 to swap, denoting 
this configuration 0'{.j^. To ensure reversibility with respect to the target, as part of the delayed rejection 
method, we construct a pseudo move which consists in starting from the second stage proposed state 0".^, 
proposing to move to d^.j^ (which swaps d'/_^ and 6'/^) and rejecting it with probability /Oi(0'/.^, 0J.^). The 
second stage move is accepted with probability 



This move allows for increased interaction within the population. At the first stage, we allow any pair 
of chains to be swapped, thus if a state of a chain at a high temperature is consistent with one of the 
distributions at a lower temperature, it is allowed to quickly jump down the ladder. 

5.2 Crossover Moves 

Liang & Wong (2001) employ various crossover moves in an evolutionary MC algorithm; the objective is 
to increase the interaction within the population. In our algorithm, we use two move types: 

I. Variable dimension crossover 

To construct a move likely to have high acceptance probability in the mixture model, we begin by reassigning 
mixture component labels in each chain so they satisfy an ordering constraint on the weights, in order 
approximately to match the labels of components in different chains. When in state 9, we select a variable 
dimension crossover with probability v{9); 



Note the case v{9) = corresponds to a 'do nothing' move. We select a pair of chains with differing 
dimension with probability inversely proportional to the squared difference between the dimensions. We 
then propose the new state of the population members, by swapping /c's and the weights. We take the 






otherwise. 



1 ii ki =^ kj for some i ^ j 
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lowest weighted component specific parameters of the higher dimensional chain to the lower dimensional 
chain, i.e. if fcjj > ki^ for the selected chains ii,i2 we propose: 

where rjj^ denotes an element of 77*^ . The acceptance probability is easily calculated and thus omitted. 
After the move has been accepted or rejected (or the do nothing move) we propose (and accept) a random 
permutation (all permutations have uniform probability of being proposed) of the labels of the parameters 
(of all the chains). This final permutation ensures invariance with respect to the target. 

II. Fixed Dimension Crossover 

We begin by reassigning the mixture component labels in each chain by ordering on the first dimension of 
the means. When in state 6, we select a fixed dimension crossover with probability 1, if it can be selected 
(i.e. there are at least two chains with the same dimensionality) otherwise we select a 'do nothing' move. 
Select a pair of chains {11,12} with the same dimensionality, with probability 

p{ii,i2\e) oc \Cn-Cn\-%,,=k,,- (5.2) 

We select a position j = 1, ... ,ki^ — 1 to crossover, this selection made with probability proportional to 
1/j and switch all component specific parameters to the left of j inclusive (note that if the identifiability 
constraint is not satisfied in the proposed state of the chain we immediately reject) and accept/reject on 
the basis of the Hastings ratio. After the accept/reject (or the do nothing move) decision has been made, 
we again propose a random permutation of the labels of the parameters. 

5.3 Snooker Jumps 

One of the most important ways we can use the information in the population is by targeting variable 
dimension jumps by using another chain. This idea is linked to the snooker algorithm of Gilks et al. (1994) 
and is performed in the following way. When in state 9, we select a birth with probability b{6), where 

1 if fcj = 1 V i 

b{e) = l ifki = k^^yi 

1 /2 otherwise 

then select a chain (the current point 9c) for which a birth is possible (let mi,{9) be the number of chains 
such that a birth can occur when in state 9) with uniform probability, and select an anchor point {9a) with 
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probability inversely proportional to the absolute value of the difference between the inverse temperatures. 
We then generate w Be{l, k^), with Be[-, ■) the beta density, and draw a new /i, A pair from: 

fea 

q{^L,K) = £/i(r;,)A/;(M^^a)IW(2r + 3,Ap 

i=i 

where 

k 

1=1 

and h{-, ■) is the Mahalanobis distance. We then perform the rest of the move as for the birth in Appendix 
1. In the death move, we perform much the same as for Appendix 1, except we select a current point with 
probability l/md{9) (where md{0) is the number of chains for which a death can occur when in state 9) 
and (redundantly) select an anchor point (which is used in the reverse birth). The birth move is accepted 
with probability min{l, A\ with: 

P(^/l:r^|^?^wJ.J.^_pfcc)^=p(fcc) k^l 

(fee + l)b{e)mdi9') Be{w; 1, k,)qi^l, $) 

where $ is the Cholesky decomposition of A (see Appendix 1 for details), p(2/i:n|?7'^, wj.j. _i,fcc)^'= is the 
tempered likelihood (for the current point), _B(-, •) is the beta function and Be{x; •, •) is the beta density 
evaluated at x. The objective of this move is to propose new component-specific parameters which are 
likely to be consistent with the data, but are markedly different from the current components. It also 
provides an adaptive element to the birth proposal, as it relies on current information, in the population, 
that is being continuously updated. 

5.4 Constrained Sampling 

One aspect of population-based simulation that is apparent is the need to maintain diversity of the popu- 
lation (as in sequential Monte Carlo - see Del Moral et. al. (2006)). In many cases for which it is difficult 
to traverse the state space, it is often the case that the chains at lower temperatures can become trapped 
(stuck in local modes) as for single chain MCMC methods, and this may lead to inaccurate Monte Carlo 
estimates of quantities of interest. To avoid this problem, we propose to constrain some of the members of 
the population, that is, for some subset Ti;n (' > 2), and i e Ti-.n, t^i is a density constrained to Ei C E. 
In the setting of trans-dimensional problems, a natural choice of sets Ei may be selected with respect 
to the model dimension (e.g. Ei — UfceK {{^} ^ f^'' K,i d K. and i E Ti-.n)- In general, choosing an 
appropriate Ei is challenging; see Atchade & Liu (2004) (unpublished technical report) for some discussion. 
We remark that, in the example in Section [SI this technique will prove to be very important. 
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5.5 The Algorithm 

To sample from the augmented distribution we use the algorithm below; we use the genetic algorithm 
terminology of Liang & Wong (2001). 

0. Initialise the chain 9. 

For t = I, . . . , M sweep over the following: 

1. Mutation. Select a chain i G {1, . . . , N} with probability Ti and then perform one sweep of the reversible 
jump algorithm in Appendix 1 for this chain (with appropriate modification for constrained targets). 

2. Make a random choice between performing steps 3 or 4- 

3. Crossover. Propose a variable dimension crossover move with probability 1/2, else propose a fixed 
dimension crossover. 

4. Snooker Jump. Propose a birth with probability b{6), else propose a death. 

5. Exchange. Perform the delayed rejection exchange move. 

Note that, for constrained chains, we only allow them to be involved in fixed dimensional crossovers and a 
special exchange move that augments step 5; we propose to exchange a constrained and non-constrained 
chain, selecting the move only if such a move may be performed (that is ki G ICj and kj G ICi for i ^ j). 
All selections are made with uniform probability and no delayed rejection is used. 

6 Gene expression example revisited 
6.1 Specification of Simulation Parameters 

Population Size : To run the population algorithm in Section 15.51 we used = 25 with 5 chains 
constrained. We recommend a large population size in general (although the discussion in Scction r4.3l should 
be considered), so that results are reasonably similar for separate runs of the algorithm. See Jasra, Stephens 
& Holmes (2006) (unpublished technical report, available from http://stats.ma.ic.ac.uk/das01/) for 
more guidance. 

Temperature Parameters: For the main population (the unconstrained chains) the following inverse 
temperatures were selected: 

Ci = 1 

for constants > 0, iy9 > 1. We selected <; = 10^^, ip — 1.85 (C20 — 0.74). Our choice of cooling schedule, 
and population size was based upon pilot tuning. We selected a slowly decreasing sequence of C's since 
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we observed a poor acceptance rate for the exchange move for distributions that were further away from 
each other, as expected. We found that the inverse temperature at which the reversible jump algorithm 
performed best (that is, reasonable acceptance rates along with regions of high support that were similar 
to the target) was C = 0.75; thus we attempted to include a distribution with this temperature. We note 
that we need to be careful when specifying temperatures, since for low temperatures (low depending on 
the problem at hand) the distribution starts to favour dimensionalities that are small, although this may 
be alleviated by specifying priors for k which penalise small values. For this particular problem, we spent 
at least 2-3 hours in tuning the parameters; a more automatic procedure may be implemented - see below. 

The term 'reasonable acceptance rate' deserves a quantitative definition. A useful criterion put forward 
by Iba (2001) is that the expected (wrt target) log Hastings ratio of the exchange probability is equal to 
1. This in turn means that, approximately, the algorithm will accept the exchange about half the time. 
We remark that such an approach may be used to provide an automatic temperature selection and will 
vastly reduce the time spent on tuning the algorithm (as noted by a referee this aspect should be taken into 
account in the computational cost and comparison of the algorithm). We refer the reader to Iba (2001) for 
the details and to Goswami & Liu (2006) (impublished technical report) for an alternative approach. 

Constrained Targets: To select the Ei, i T2i:25 we used a pilot simulation. We ran the algorithm with 
N = 25 and the inverse temperature parameters discussed above, only 9 chains in the main population 
and 16 constrained chains (given inverse temperature parameter 0.999). We selected the subspaces Ei 
with respect to the dimensionality, that is we had 10 chains constrained to lie k G Ti-2, ■ ■ ■ , T'ig:2o then six 
other chains constrained to lie in k G T3:6, T6:9, Tg:i2, Ti2:i5 and Tis^ig. This was adopted in order to 
determine whether there was any support outside T3:ii found in Section [2.31 Based upon a short run, the 
five constrained chains were taken as T2-a, T^-Mj "^s-.i a-nd T7:g, Tg^n. The idea of the pilot tuning is to avoid 
wasting CPU time on population members constrained to lie in areas of support that have low density with 
respect to the original target density. Additionally, the constrained chains need to be able to interact with 
the main population and the preliminary tuning allows us to make this choice. The inverse temperature 
parameters for the constrained chains were 0.999, since we seek to maintain diversity with respect to the 
population at colder temperatures. For further discussion in the setting of partitions, especially in the 
context of reversible jump, see Atchade & Liu (2004) (unpublished technical report). 

The algorithm in Section 15.51 was run for 1 million sweeps which took approximately 9^ hours (code 
is available on request from the first author). A minor modification to algorithm was made: that if a 
crossover move was selected, we also propose an exchange for the chain of interest. 
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6.2 Comparison with Vanilla Sampler 

The improvement over the vanilla sampler is substantial, on average, the chain of interest took 6.75 sweeps 
to jump between a mixture model with less than 4 components to a mixture with more than 6. 

The observed inability of the vanilla reversible jump algorithm to move around the state space (from 
k e T3:5 to A: G Tg^ii) is not present for the population sampler, since it may represent both parts of the 
space simultaneously. Note that, due to the complexity of the target, we cannot claim that the sampler 
has converged; there may be regions of high posterior density that are still unexplored. 

It is apparent, from simulations, that despite the substantially improved performance when compared 
to the vanilla sampler, the sampler has missed the region k E {8, 9, 10}. This is due to the reasons discussed 
earlier, and demonstrates that constrained chains can help guard against such problems. 

The effective sample size (ESS) is a standard measure of the relative efficiency of an MCMC sampler (see 
Liu (2001) for full definition). For our population algorithm, the ESS for k was 59745 (60000 samples, using 
a lag of 10 in the autocorrelation calculation), compared with 2998 and 3591 for both vanilla algorithms. 
Taking 



2 ESSpop , 
M T 

-^»^^pop-^ pop 



ESSvani ESSv 



where the subscripts refer to the population, vanilla algorithm runs 1 and 2 (the chains run in Section 
12. 3p and chains respectively, M is the sample size and T is the CPU time, we obtain E = 2.07 (wc 
note that similar conclusions are drawn when taking into account tuning time). Therefore there is little 
contest between using population-based reversible jump and the vanilla counterpart for this example; the 
population approach is far superior (note that all coincidental simulation parameters are the same between 
algorithms) . 

6.3 The Efficiency of Sampler Moves 

The exchange move was accepted 44% of the time at the first stage and 75% at the second. This indicates 
that delayed rejection helps to ensure that the algorithm is constantly swapping information between the 
chains; for 86% of the sweeps there is at least one exchange. 

The snooker and variable dimension crossover moves have acceptance rate less than 1%. That this 
occurs is to be expected. Liang & Wong (2001) report fairly small acceptance rates for their crossover 
moves; our rates are smaller as our algorithm operates upon are a more complex space. Our experience 
with the snooker and variable dimension crossover in more simple examples, is that they are generally 
not worth the extra coding effort given their performance. However, we were satisfied with the fixed 
dimensional crossover which was accepted 2.9% of the time. The snooker birth is accepted more often than 
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the standard birth, but the reverse snooker death move is rarely accepted (c.f. a birth move with a proposal 
that has low variance). Hence the move is less successful overall than the standard birth. The variable 
dimension move acceptance rates (averaged over all chains) were still below 1%, with the split/combine 
move being less effective. 

Overall, our recommendation is that the exchange (with delayed rejection) and fixed dimensional 
crossover are implemented. This is in conjunction with constrained targets. 

6.4 Comparison with Simulated Tempering 

A more appropriate single chain sampler to compare with the population method is a simulated tempering 
algorithm; see Hodgson (1999) for an example of another variable dimension simulated tempering algorithm. 
Here the target distribution is: 

TTie,C,k\y,.,n) (X p{iji..n\0,k)'^pie,k)p{O (6.3) 

that is, C G ■Z^ is stochastic with Z finite and pseudo prior p{Q (in this example we set the prior as opposed 
to constructing one adaptively; see Geyer & Thompson (1995)). Note that our formulation does not require 
any normalization constants to be known, although such an algorithm will allow us to find a 'good' pseudo 
prior; see Geyer & Thompson (1995). 

To sample from (16.31) we use the reversible jump algorithm in Appendix 1, conditional upon C, and to 
update ^ a delayed rejection move was adopted: propose a temperature uniformly at random from Z and 
accept or reject with the Hastings ratio - if rejected, select an adjacent temperature and perform a pseudo 
move that selects to move from the proposed temperature at the second stage to the proposed temperature 
at the first stage. 

Using some trial simulations we were unable to find a pseudo prior so that for a reasonable number of 
temperatures (e.g. 25), the algorithm could jump between the target and the inverse temperature 0.75. As 
a result, for any sensible number of distributions, the performance of this approach only slightly improves 
over the vanilla algorithm. An example of a run of the simulated tempering algorithm was setting p(C = 
C,i) oc 1/i, with (,1 — 1 and having a difference of 1 x 10^* between each temperature. We found, with 
\Z\ — 25, that the algorithm only visited the distribution of interest 10% of the time in a run of 250000 
sweeps. 

In this Section, it was seen that simulated tempering has been difficult to set up, so that it can 
be operated efficiently; see Atachde & Liu (2004) (unpublished technical report) for a more automatic 
procedure. In addtion, see Zheng (2003) for a theoretical comparison of population MCMC and simulated 
tempering. 
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7 Discussion 

To summarize, in our experience a vanilla reversible jump algorithm often fails to explore the support 
of a multimodal model space. We introduced population-based reversible jump MCMC and gave some 
theoretical justification for why these methods can be preferable to standard MCMC methods. In addition, 
it was demonstrated that population-based reversible jump is a means to improving variable dimension 
simulation. Overall, our method helps open up the possibility of fully Bayesian analyses in problems for 
which simulation is prohibitively slow. Note that the basic algorithm (without population moves) can 
easily be coded given a vanilla sampler. Therefore population MCMC provides a simple way to check the 
performance of MCMC algorithms. 

One of the problems of our approach is the limited amount of success of our crossover moves. Whilst 
this was observed for simpler problems in Liang & Wong (2001), we would still hope that the population 
can provide more information when proposing moves. General guidelines for constructing constrained 
subspaces of E and finding efficient ways to make them interact with the population is an important area 
for further investigation. There are many potential methodological extensions that may be considered. 

Firstly, to combine our approach with adaptive MCMC methods (see Andrieu & Moulines (2006) and 
Chauveau & Vandekerkhove (2002) in the population context). This is likely to be superior to standard 
adaptive algorithms, since there is more information to update proposals. Furthermore, there is more 
information in terms of where the chain has not been, i.e. we may search (fewer) regions of the support of 
TT for states with high density. 

Secondly, we may consider combining some of our ideas with many recent stochastic simulation algo- 
rithms. For example, we might use the constrained chains in a similar context to the equi-energy sampler 
of Kou et al. (2006). In the trans-dimensional case, the energy rings could be replaced with dimension 
rings. This is likely to produce a highly diverse sample with respect to the dimensionality. 
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Appendix 1: Reversible Jump Sampler 

The vanilla sampler used in Section [2.31 is now outlined. 
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One of the drawbacks of the model we have selected is the need for A to be positive definite. As a 
result, moves in MCMC simulation will be difficult to construct such that this constraint is satisfied. To 
deal with this problem we consider the Cholesky decomposition $ (see Dellaportas & Papageorgiou (2006) 
for an analysis using the spectral decomposition). That is, A = where $ is lower triangular with 
positive diagonal elements (recall the Jacobian is T" 115=1 </'h~'^^)- ^^'^ RJMCMC algorithm is as follows 
(all moves are Metropolis-Hastings steps unless otherwise stated). 

Firstly the fixed dimensional moves. The component specific means (/ij) and component specific lower 
triangular part of (E>j are both updated via an additive cauchy random walk, independent in each dimension. 
The component specific diagonals of $j are updated via a multiplicative log-normal random walk, indepen- 
dent in each dimension. The weights are proposed using an additive normal random walk on the logit scale. 
Finally, * is generated using a Gibbs kernel; the full conditional is >V(2(5 + A;a'), (2/i + 2 

Secondly a birth/death of a component, largely following Richardson & Green (1997). Briefly, we draw 
a new and $ from the prior and w Be(l, fc), setting the new weights as {w\{\ — w), . . . ,Wk{l — w),iu), 
selecting the move with probability bk (when in state k). The death, selected with probability dk, is 
performed by selecting a component to die with uniform probability and inverting the jump function. 

Finally, a split /combine of a component. We select a split with probability Sk and choose a component 
j* uniformly at random to split into components labelled as {ji , 72) • The split requires the following actions: 

(i) Split the weight by drawing ui ~ Se(7, 7) and set 

(ii) Split the mean vector by drawing ^1(2), . . . , Wr(2) ^ -^(0) ct/*) and take 

MiOi) = MiO*)+w;(2) 

(iii) Split the off diagonals of $ by drawing ^21(3)! • • • > Wj.(r-i)(3) A/'(0, a^) and take 

4>lm(ji) = 4>lm(j-) + UimCi) 
4'lm{h) = (t'lmij") - Ulm(3) 

where Z = 2, . . . ,r, m = 1, . . . , / — 1. 

(iv) Split the diagonals of $ by drawing 'Uii(3), . . . , ^^.^(s) ~ CAf{Q, a) and take 

^iiih) = - — ^11(32) = (l^ii{3*)'^im- 
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In order to combine we select the move with probabihty Cfe and invert the jump function above. We 
note that due to the symmetry constraint imposed on the jump function it does not matter which way 
we combine the components (see Cappe et aL(2003) for details on this). We choose two components to 
combine, when in state k, with probability inversely proportional to the Mahalanobis distance between 
them, that is: 

Pk Ui ,32) tx - )' {^ij, - fij^ ) + (^j, - ^ij, )' {fij^ - )] "\ 

The split in state k is accepted with probability min{l, A} where 

A = (hkelihoodratio)^%M£^^!^^^ 

p{k + 1) (fc + 1)! kck+iPk+i{ji,j2) \J\ 

p{k) kl Sk 2qi(wi)g2(u2)93(u3) 

where | J| is the Jacobian: 

and obvious notation for the prior and proposal densities. 

The algorithm is performed in a deterministic sweep over all fixed dimension moves followed by a 
random choice of birth/death or split/merge. The particular trans-dimensional move is selected with 
uniform probability (assuming we allow a move, i.e. no birth or split when k = fc,„ax or death or combine 
when fc = 1). 

Appendix 2: Proofs 

Proof of TheoremUl ■ Our strategy is to show that k^-^{xi.,j^, A) = {Km oKeo Km)^~^{xi.,n,A) {A e 
S-^) is a uniformly ergodic Markov kernel. Then to prove K^(^~^^ is uniformly ergodic, we may use the 
fact that K3(^-i)(a;i:Ar,^) > t^^^-^^I - t)^-^K^-^{xi..n,A) (proved below). We begin by proving the 
case N = 2 and then use an induction on N. The strategy of the proof is to use the uniform ergodicity of 
Kj* (which we take to be Kn) and the acceptance of an exchange move. We assume uq = 1 as the proof 
can be extended to the case no > 1 with only notational complications. We denote xf"^ as the value of Xi 
after I steps. 

We first establish K^^^-^^xi-.n , A) > r2(^-i)(l - t)^~^K^-^{xi.,n, A), A e 5^. Consider K^, then: 

K^{xi..N,A) = [ K{xi..N,dx%) [ k{x%,dx^^^j,)k{x%,A) 
Jen Jen 



> 



K{xi:N,d,x[^^j^) / K{x[^]^, dx[%)rKMix[%,A) 

EN Je" 
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where we have appUed Chapman-Kolmogorov and used K(xi-2, •) > tKm{xi:n , •) V xi;n G . We then 
have 



K'\x^.,N.A) > t\1-t) KM{xi..N,dxY.'N) KE{x\'.>j,.dx\'>^)KM{x\'.'N.A) (7.4) 

which corresponds to selecting a mutation followed by an exchange and then followed again by another 
mutation. This clearly follows for K^'^'^~'^^ {xi-n , A)^ by the argument above: 

K'^^-'\x,.,^^A) > r^(l-r)/ K'(^-'Hx,..^,dxl^^-'^)K{xl^^-'\A) 
> rHl-rf f K'^''''\x,..^,dxl^^''^)K\xl^i^-'\A) 

and we may thus apply the argument above, recursively, to demonstrate the result. We now drop the 
T^(l — r) and prove uniform ergodicity of K. 

Let N = 2, A = Ai x A2 e £ ® E. Using the fact that Km{xi:2, •) ~ Ki{xi, ■)K2{x2, •) and applying 
the minorization condition for K2{x^2 \ A2), the modified equation (|7.4p becomes 

K{xi..2,A) > [ Ki{xi,dx^P)K2{x2,dx'i^)x 
Je^ 

/ 5^m{dx\>)6^ii,{dx\>)T^xnll, 
JE^ 1 ^ L t:i{x\ )tt2{x)2 ') 

K^{x^^\A^)ev{A2) (7.5) 

where we have ignored the rejection of an exchange move. Since 

. /. 7i-i(a;2)7r2(a;i) \ . / ni{x2) \ , \ ^ jp2 fn a\ 

minO, — - — r — - — -\ > mm<^l, — - — - — ^ V (a;i, ^2) £ (7.6) 
1^ ■Ki[xi)T:2[X2j J L •^2(a;2)£'i J 

and using the measurability of the function, we can split the integrals in equation ()7.5|) into Ii x I2, where: 

h = I Ki{xi,dx['^) f J (i,(d4") 



X 



IE 1 

I2 ^ f if2(a;2,d4'^)'5^a)(d4'Vin(l,^^i%^Wi(4'\^i)- 

Clearly /i = 1. For I2, integrating with respect to Dirac measure and then applying the minorization 
condition we obtain: 



I2 > e 



Kdx«) min (1, ^i^||i^Ui(x«, Ai). 



E I 'n-2ix'^^)gi 



Therefore equation (|7.5p becomes: 

Kixi..2,A) > 01^* (A) (7.7) 
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where 



v*{A) = K*{A,MA,) 



J i'((ia;2) min 1 1, 



7^2 (2^2 

7ri(x2) 



Since (|7.7p holds V xi;2 eE^,VA££i^£ and the product measure is non-trivial, we have that is 
(1,0,1^*) small, similarly K'^ is uniformly ergodic. 

Now suppose K(x2:N, dx2:N) is uniformly ergodic (for arbitrary mixture parameters) under the specified 
conditions with E'^"^ {{N - 2), e, i^) small. Let 9i = J2iL2 ^i'- For A = Ai x ■ ■ ■ x An ^ A^n e we 
have that: 

if(^-i)(xi:jv,^) > {l-0,f-^ f k{xi..N,dx'^^l,)Ki''-^\xi%,A2..N)x 

kI''~\x['\a,) (7.8) 

where Kg-^ denotes that the mixture parameters in the exchange kernel have been modified by 9i . Equation 
(j7.8p refers to the fact that the probability of moving to A under K for any x e E^ is greater than 
considering the kernel that updates xi independently of X2:N (i-e. we can apply much the same arguments 
as for demonstrating K^'^^'^^xi-.n , A) > r2(^-i)(l - r)^-^ K^-^xi-.n , A)). We then have: 

i?(^-i)(xi:jv,^) > e{l~eif-' f K{xi..N,dx^^.lMA2..N)K^,''-Hx^i\A^). 

At this point we may apply the above arguments to yield that E^ is (A^— 1, 9, ly*) small for appropriate 9 > 
and probability measure v*. Thus the result follows by induction and the fact that K'^'^^~-^\xi;n, A) > 

Proof of Proposition^^ Our approach is to focus upon the Dobrushin coeficient of Km ° Ke- Consider 
N = 2, f -.E^ -^[0,1]: 

Pl,2{^l,X2)f{x\.2)5xi{dx'2)5x2{dx\)KM{ui,2,dxi,2) 

E* 



\Km o KE{f){u^:2) - Km o KE{f){vi..2)\ < 

Pl,2{xi,X2)f{y'i,2)5xt{dy2)Sx2{dy[)KM{vi:2,dxi:2) 



E^ 



E-i 



[1 - Pi,2(2;i,a;2)]/(a;'i:2) 



5xi{dx'^)5x2{dx'2)KM{ui;2,dxi;2) - / [1 ~ Pi,2{xi,X2)]f{x[.2)6xAdy'i)Sx2{dy'2)KM{vi:2,dxi;2] 

J El- 

< [osc(/pi,2) + osc(/[l - piMPiKn) 

< 2osc{pi, 2) P (Km) 
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where we have used the fact that osc{fg) < sup(/)osc(gf) (since inf(/) = and sup(/) = 1). Note: 

0Sc(pi,2) = 1- inf pl,2{Xl,X2)- 

Now, since we have the mixing condition on Km we have that: 

P{KmoKe) < 2(1- inf pi,2{xi,X2)){l-e). 

(xi,x2)eB2 

Since 

\\r]{KMoKEr-n*\\Tv < P{{Km o KEr)\\rj - n*\\Tv 

(e.g. Del Moral (2004) chapter 4) we easily yield the desired result, for N = 2. The result for A/' > 3 can 
be proved using the above arguments, with only notational complications. □ 
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